Goto

Collaborating Authors

 bakshi and kothari


Conditional Linear Regression for Heterogeneous Covariances

arXiv.org Machine Learning

Linear regression is a technique frequently used in statistical and data analysis. The task for standard linear regression is to fit a linear relationship among variables in a data set. Often, the goal is to find the most parsimonious model that can describe the majority of the data. In this work, we consider the situation where only a small portion of the data can be accurately modeled using linear regression. More generally, in many kinds of real-world data, portions of the data of significant size can be predicted significantly more accurately than by the best linear model for the overall data distribution: Rosenfeld et al. (2015) showed that there are attributes that are significant risk factors for gastrointestinal cancer in certain subpopulations, but not in the overall population. Hainline et al. (2019) demonstrated that a variety of standard (real-world) regression benchmarks have portions that are fit significantly better by a different linear model than the best model for the overall data set; Calderon et al. (2020) presented further, similar findings. We will consider cases where linear regression fits well when the data set is conditioned on a simple condition, which is unknown to us. We study the task of finding such a linear model, together with a formula on the data attributes describing the condition, i.e., the portion of the data for which the linear model is accurate. This problem was introduced by Juba (2017), who gave an algorithm for conditional sparse linear regression, using the maximum residual as the objective.